The Annals of Applied Statistics 
2008, Vol. 2, No. 4, 1478-1502 
DOI: 10.1214/08-AOAS188 

© Institute of Mathematical Statistics. 2008 



BAYESIAN MULTINOMIAL REGRESSION WITH CLASS-SPECIFIC 
PREDICTOR SELECTION 1 

By Paul Gustafson and Genevieve Lefebvre 

University of British Columbia and Universite du Quebec a Montreal 

Consider a multinomial regression model where the response, 
which indicates a unit's membership in one of several possible un- 
ordered classes, is associated with a set of predictor variables. Such 
models typically involve a matrix of regression coefficients, with the 
(j, k) element of this matrix modulating the effect of the kth predic- 
tor on the propensity of the unit to belong to the jth class. Thus, 
a supposition that only a subset of the available predictors are as- 
sociated with the response corresponds to some of the columns of 
the coefficient matrix being zero. Under the Bayesian paradigm, the 
subset of predictors which are associated with the response can be 
treated as an unknown parameter, leading to typical Bayesian model 
selection and model averaging procedures. As an alternative, we in- 
vestigate model selection and averaging, whereby a subset of individ- 
ual elements of the coefficient matrix are zero. That is, the subset of 
predictors associated with the propensity to belong to a class varies 
with the class. We refer to this as class-specific predictor selection. 
We argue that such a scheme can be attractive on both conceptual 
and computational grounds. 

1. Introduction. Consider multinomial regression modeling where, for a 
single observational unit, the response variable Y € {0, 1, . . . , c} indicates to 
which of (c+ 1) classes the unit belongs. Correspondingly, X = (X\, . . . , X p )' 
is a vector of p predictor variables. Taking the Y = class as a reference, the 
association between Y and X is modulated by a c x p matrix of coefficients /?, 
with j3jk describing the effect of Xk on membership in the jth class relative 
to the Oth class, for j = 1, . . . , c. Various link functions can be used to relate 
the class probabilities to the linear predictor (3X. 
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As with other forms of regression modeling, there may be a desire or need 
to consider some form of predictor selection, particularly if p is large. The 
standard approach to predictor selection is to consider eliminating some 
predictors entirely, which in the multinomial model context equates with 
setting some columns of (3 to zero. Alternatively, though, one can consider 
the possibility of setting some individual elements of (5 to zero. That is, the 
subset of predictors associated with the propensity of being in class a may 
differ from the subset associated with the propensity of being in class b. 
This paper investigates this possibility and compares it with the standard 
approach. 

To be more specific, we introduce a hierarchical prior distribution under 
which individual elements of (3 can be zero, which we refer to as class- specific 
predictor selection (CSPS). The prior includes a hyperparameter governing 
the tendency for each predictor to be simultaneously irrelevant or relevant 
for multiple classes. A limiting value of this hyperparameter corresponds 
to ordinary predictor selection (OPS), whereby each predictor is entirely 
included or excluded from the model. 

In fact, variants of the CSPS idea have arisen recently in the literature 
in different contexts. In a clustering context, both Friedman and Meulman 
(2004) and Hoff (2006) develop schemes whereby the subset of variables dif- 
ferentiating a given cluster from other clusters can vary with the cluster in 
question. Friedman and Meulman (2004) approach the problem by modify- 
ing existing approaches to distance-based clustering. They make the point 
that discovering structure whereby the subset of relevant variables differs by 
cluster can be important it its own right, in addition to perhaps giving a bet- 
ter clustering of the data overall. Hoff (2006) approaches the problem using 
nonparametric Bayesian technology based on Dirichlet processes, with the 
view that a mean-shift vector is associated to each cluster, with the position 
of the nonzero elements in this vector allowed to vary across clusters. 

Whereas Friedman and Meulman (2004) and Hoff (2006) consider unsu- 
pervised clustering problems, the present paper deals with supervised clas- 
sification problems. In an intermediate vein, Kueck, Carbonetto and de Fre- 
itas (2004) consider a semi-supervised problem, where class labels are not 
observed but there is some structural information about the labels in the 
data. These authors apply a binary regression model to relate predictors 
with membership in the jth class versus membership in any other class. 
They cycle through and apply this procedure to each class, that is, carry 
out (c+ 1) binary regressions all told, letting the subset of relevant predictor 
variables be different each time. This is somewhat similar in flavor to the 
present approach, except that by working with multinomial regression mod- 
els we obtain a real probability distribution over classes for any given value 
of the predictor variables. As well, the present prior specification and com- 
putational scheme bear little relation to Kueck, Carbonetto and de Freitas 
(2004). 
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Perhaps the most related work to ours is Zhou, Wang and Dougherty 
(2006), who do consider a form of CSPS in multinomial-probit models. How- 
ever, the prior structure they use is entirely different from ours. Particularly, 
we use a hierarchically structured prior whereby a hyperparameter controls 
the extent to which CSPS is instantiated relative to OPS. The focus in Zhou, 
Wang and Dougherty (2006) is largely on computational issues surrounding 
scale-up to very large datasets, whereas the present emphasis is more on 
the interpretation and relevance of CSPS schemes. Other related work in- 
cludes Sha et al. (2004) who consider (ordinary) Bayesian variable selection 
for multinomial-probit models. Such schemes tend to be built upon MCMC 
approaches to model selection and averaging in linear models [e.g., Brown, 
Vannucci and Fearn (1998), George and McCulloch (1993), Smith and Kohn 
(1996)]. 

In terms of modeling X given Y in an unsupervised context, Liu et al. 
(2003) and Tadesse, Sha and Vannucci (2005) pursue Bayesian schemes for 
(ordinary) variable selection, and Kim, Tadesse and Vannucci (2006) expand 
on this further using Dirichlet Process technology. Doing variable selection 
when modeling X given Y requires assumptions about how the relevant and 
irrelevant components of X are associated, and, in fact, an independence 
assumption is used in the above-mentioned work. In a semi-supervised con- 
text, Gustafson, Thompson and de Freitas (2007) attempt to model X given 

Y in a way that (i) the relevant components of X can vary with Y, and (ii) 
the irrelevant and relevant components of X (given Y) are not assumed 
to be independent. However, this approach becomes quite cumbersome in 
computational terms. Issues surrounding class-specific predictor selection are 
much more cleanly considered in the present supervised context of modeling 

Y given X. 

The emphasis in what follows is that for many problems CSPS may be 
more scientifically plausible than OPS, as the subset of predictors which 
"characterize" membership in a given class will indeed vary from class to 
class. Thus, we argue for CSPS simply on the grounds of realistic modeling a 
priori. As a curious side benefit though, we find there can be computational 
advantages in implementing CSPS over OPS. Particularly, computational 
schemes for posterior sampling over a space of competing models are gener- 
ally susceptible to poor performance if the models are too far apart. That 
is, a scheme to move from a current model to a nearby model, perhaps while 
keeping other parameters and latent variables fixed, may not work well due 
to large changes in likelihood. In moving from OPS to CSPS we induce a 
much larger and more nuanced space of models, that is, turning a single 
component of (3 on or off is a much smaller change than toggling an entire 
column. This can facilitate the use of MCMC algorithms in computing the 
posterior distribution over the model space. 
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In Section 2 we describe the simple form of the multinomial-probit model 
that we use for investigating CSPS, and then describe our prior distribution 
which includes OPS and varying strengths of CSPS as special cases. Sections 
3 through 6 contain examples of CSPS, while Section 7 gives some concluding 
remarks. An appendix describes implementation via MCMC techniques. 

2. Methods. 

2.1. Multinomial-probit model. Often multinomial models are motivated 
by underlying models for continuous latent variables. Particularly, the ob- 
servable response Y can be thought to arise from an unobservable vector of 
continuous variates Z = (Z\, . . . , Z c )' , according to 

!y, if Z y = max Zj > 0, 
i 
0, if max Zj < 0. 
3 

Roughly interpreted, Zj is the propensity of the unit to belong to the jth 
class, with the biggest positive propensity "winning" the unit. If all the 
propensities are negative, then class zero wins. Note that the asymmetry 
between class zero and the other classes is actually useful in some appli- 
cations, as class zero can be interpreted as those units lacking sufficient 
propensity to belong to any of the classes of interest. 

One common specification arises from taking Z to have a multivariate 
logistic distribution with location vector j3X (and a correlation structure 
arising from differencing independently distributed variates) . This yields the 
well-known multinomial-logistic model, with 

(1) ll { ^ll\x] =ewm '- Pb ' )xl 

where denotes the jth row of (3 (with the convention that /?o» = 0). 

Replacing the multivariate logistic distribution with a multivariate normal 
distribution yields a multinomial-probit model, via 

(2) Z\X~N c (f3X,Z). 

As construed here, (2) has the disadvantage of not leading to a closed-form 
expression for the distribution of Y\X akin to (1), but the advantage of a 
latent variable representation which is more computationally expedient. For 
the sake of simple exposition and computation concerning the predictor se- 
lection problem, in what follows we work with the multinomial-probit model, 
and, in fact, only consider the special case that the variance matrix is set 
to the identity matrix, that is, £ = I c [Zhou, Wang and Dougherty (2006) 
also invoke this simplification]. This specification will be seen to facilitate 
computations greatly, though it also presents several limitations. First, it is 
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a stronger assumption than is needed for identification. In fact, there is a 
considerable literature on identification for multinomial-probit models [see, 
for instance, Bunch (1991), Weeks (1997)]. Typically one can ensure formal 
identifiability by fixing a single diagonal element but letting £ be otherwise 
unknown. However, Keene (1992) emphasizes that "formally identified" does 
not always imply "practically estimable" in the multinomial-probit context. 
Also, Yau, Kohn and Wood (2003) argue that taking £ to be known is com- 
parable in modeling flexibility to the multinomial-logistic model. Second, 
specifications which postulate independent latent variables are sometimes 
criticized on the grounds that an "independence from irrelevant alternatives" 
assumption is not appropriate in some applications [see Train (2003) for dis- 
cussion]. Third, the specification £ = I c necessarily entails some asymmetry 
between class zero and the other classes. A formulation under which Z arises 
from differencing c + 1 independent variates (as in the multinomial- logistic 
model) would remove this asymmetry but would make £ nondiagonal. Thus, 
extension to nondiagonal £ would be a useful future development. 

From the point of view of Bayesian computation, the latent variable for- 
mulation is very useful. In particular, it allows the application of MCMC to 
the joint posterior distribution of latent variables and parameters, which is 
typically more amenable than the posterior distribution of parameters alone 
[see, e.g., the landmark work on Bayesian multinomial models by Albert and 
Chib (1993)]. 

2.2. Predictor selection and prior specification. Starting with (2) as the 
model for the latent vector Z underlying the observed Y, a prior distribution 
is specified for the matrix of coefficients (3, allowing for the possibility that 
some of the elements are zero. Henceforth we make our notation more spe- 
cific in that X comprises p predictors which can vary across units, whereas 
(3 is a cx (p + 1) matrix of coefficients, with columns indexed from zero 
through p, such that the 0th column comprises intercept terms. Commensu- 
rately, let M be a similarly-indexed c x (p+ 1) matrix whose binary elements 
indicate which coefficients are nonzero. A hierarchical prior distribution of 
the form 7r(f3,M,q) = ir(f3\M, q)n(M\q)iv(q) is specified, where q is an un- 
known parameter governing the proportion of coefficients which are active, 
as described further below. For ease of notation the prior for j3 is expressed 
via independent normal distributions, with zero variances for the null coef- 
ficients. That is, given (M,q), the elements of (3 are independently normally 
distributed a priori, with means and variances: 



(3) 



E{(3 jk \M,q} 




if M jk = 0, 
if M jk = 1. 
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Here Mj + = Ylk=o denotes the number of active coefficients for the jth. 
class, while p, = (po, . . . ,p p )' and r 2 are user-specified hyperparameters. 

As a default choice for p, the prior distribution for nonintercept terms 
is centered at zero, as is customary, that is, p\ = ■ ■ ■ = p p = 0. Rather than 
taking po = 0, however, we set 

(4) M = ^>- 1 {l-(c+l)- 1/c }. 

Conceptually, setting each intercept term (3j$ to this value corresponds to 
(Y \Xi = 0, . . . , X p = 0) being uniformly distributed over {0, . . . , c}. Thus, 
this seems an appropriate value at which to center the prior distribution of 
the intercept terms, particularly in problems where the predictors have been 
centered around zero. 

The next stage of the hierarchical prior specification involves a distri- 
bution for M given the unknown parameter q governing the proportion of 
active coefficients and a hyperparameter p which is taken as known. Here 
flexibility is desired, so that at one extreme all the elements of M might 
be assumed independent a priori, which can be thought of as the strongest 
possible sense of CSPS, that is, the activity or inactivity of other elements 
in the kth column has no bearing on Mjk- At the other extreme, taking the 
column elements to be perfectly positively dependent (i.e., either all zero 
or all one) reverts to OPS. One simple route to achieving such flexibility in 
the prior distribution is described below. First though, note that the first 
(0th) column of M is taken fixed as M,o = l c , as removing intercept terms 
is seldom if ever sensible in multinomial regression models. Given param- 
eter q, the remaining columns of M are taken as independent and identi- 
cally distributed. In particular, each column is assigned a mixture distribu- 
tion: with probability (1 — q), the column elements are distributed as i.i.d. 
Bernoulli{(l — p 1 ^ 2 )q}, and with probability q, they are distributed as i.i.d. 
Bernoulli{(l-p 1 / 2 )g + p 1 / 2 }. This prior specification for M given q (and p) 
is interpretable in that Pr(Mj/% = l\q) = q for each (j, k), hence, q is the "pop- 
ulation" proportion of active coefficients. Moreover, Corr(M a fe, M^) = p for 
a 7^ b, so that p governs the tendency for a given predictor to be simulta- 
neously active or inactive across the c classes. In the extreme case of p = 0, 
all elements of M are a priori independent given q. In the extreme case of 
p = 1, each column M.^ is either C or l c , that is, OPS results. 

In what follows we do assign a fixed value to p, suspecting that data are 
not very informative about this quantity. Conversely, we do assign a prior 
distribution to q, namely, q ~ Beta(7i, 72). The hope is to engender some 
learning about q from the data, that is, the data suggest what proportion 
of the coefficients should be active for the problem at hand. 

The specification (3) gives the within-model prior variance of an active 
coefficient as inversely proportional to the total number of active coeffi- 
cients for the class in question. This is a legitimate specification, as the 
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prior distribution on M forces each class to have at least one active coeffi- 
cient (the intercept term). The rationale for this specification is as follows. 
Under the default choice of //, the squared distance \\@j m — /j,\\ 2 is the sum of 
squared coefficients for the nonintercept coefficients for the jth class [plus 
the squared difference between the intercept term and (4)]. At least roughly 
then, \\Pjm — /u|| 2 reflects the strength of association between Zj and X. Now 
specification (3) has the particular feature that 

E{\\P j .- f x\\ 2 \M,q}=r 2 , 

for all M. That is, as M varies, the number of predictors involved varies while 
the strength of the relationship is fixed. This is seen as a desirable property, 
as the interpretation of M is clear, and not confounded with the strength of 
the Z — X relationship. In fact, the literature on Bayesian model averaging 
and selection includes discussion of how the within-model prior distribution 
on parameters should vary across models, and many of the suggested schemes 
involve the width of prior distributions for active coefficients decreasing with 
the number of active coefficients. See, for instance, Fernandez, Ley and Steel 
(2001) for discussion. 

We emphasize that our prior structure is entirely different than the non- 
hierarchical prior employed by Zhou, Wang and Dougherty (2006). They 
consider only the two possibilities corresponding to p = and p = 1 in our 
framework, and they treat the prior probability that a coefficient is active as 
fixed and known, rather than estimable. They also employ a different prior 
specification for the magnitude of active coefficients. 

It should also be stressed that M must be thought of somewhat differ- 
ently under CSPS than under OPS, particularly in relation to the class 
probabilities. Under OPS, setting the kth column of M to implies the 
class probabilities do not depend on X^. Conversely, setting Mj^ = un- 
der CSPS implies that the latent variable Zj does not depend on X^, but 
necessarily Pt(Y = j\X) will still depend on X^ to some extent, via the 
requirement that the class probabilities be normalized. It is hard then to 
divorce the concept of CSPS from the underlying latent variable view of 
multinomial regression. Thus, we interpret Mj^ as governing whether the 
jth class propensity depends on X^, rather than interpreting it in terms of 
class probabilities. 

To elaborate slightly on this issue, admittedly the CSPS interpretation 
is somewhat hampered by using a probit model rather than a logit model. 
As pointed out by a reviewer, in a multinomial-logistic model of form (1), 
having M a k = = implies that the distribution of (Y\Y 6 {a, b}, X) does 
not depend on Xk, so that some direct interpretation of the CSPS structure 
in terms of class probabilities is possible. However, it is straightforward to 
verify that this property does not hold in the multinomial-probit case, hence, 
we do tradeoff some ease of interpretation for ease of computation. 
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To summarize, specification of hyperparameters /z, r 2 , p and 7 leads to 
a completely specified prior distribution. MCMC sampling from the joint 
posterior distribution of (Z^\ . . . , Z^ n \ (3, M, q) can be implemented in a 
relatively straightforward manner. Ideas from Holmes and Held (2006) are 
useful in this regard. In particular, it is possible to marginalize and obtain 
an explicit expression for the joint prior density on (Z,M), with the cor- 
responding posterior being the truncation of this distribution to values of 
Z consistent with the observed values of Y. Thus, the MCMC algorithm 
operates with ir(Z,M,q\D) as the target density. As a post-processing step, 
sampling from Tr((3\Z,M,q,D) is used to complete the posterior sample. 
Further computational details are given in the Appendix. 

2.3. Inference on the matrix of coefficients. Different approaches can be 
taken when estimating the coefficients (3. Let Mjk = E(Mjk\D) be the pos- 
terior probability that the (j, k) class-predictor pair is active, as computed 
via MCMC output. The first estimator of (3jt we consider is simply the 
posterior mean, E{(3jk\D). This is a weighted average of zero (with weight 
1 — Mjfc) and nonzero values (with weight Mj^). On the other hand, if model 
selection rather than model averaging is emphasized, then one might report 
estimates arising from the single model M having highest posterior prob- 
ability. As a further alternative, following Barbieri and Berger (2004), one 
might report inferences arising from the median probability model M*(D), 
which includes exactly those terms having marginal inclusion probabilities 
(i.e., Mjk) greater than 0.5. Since Barbieri and Berger (2004) present a case 
for good prediction arising from the median probability model, and since 
finding this model is an easier computational task than finding the highest 
posterior probability model, we consider E((3\M = M*(D),D) as a second 
estimator of (3. Clearly this is a more sparse estimator than the (model- 
averaged) posterior mean, with potentially many components being zero. 

3. Example: synthetic data. Data are generated with c + 1 = 6 classes 
and p = 15 predictors. The distribution of (X\, . . . X15) is taken as multi- 
variate normal with standardized marginals. The first block (X\, . . . ,Xq) is 
taken as equi-correlated with correlation coefficient 0.5. The next block is 
generated as Xj ~ N(0.8Xj^q, 1 — 0.8 2 ), for j = 7, . . . , 12, yielding across- 
block correlations of 0.8 for aligned elements and 0.8 x 0.5 = 0.4 for non- 
aligned elements. The remaining three elements (X13, Xn, Xi$) are taken 
as independent of all other elements. 

In the first instance n = 250 observations are generated when all elements 
of (3 are zero except for intercepts [3jo = $ _1 {1 — (c + l) _1 / c } for j = 1, . . . c 
(the centering value mentioned earlier), and ((3jj,f3jj + i) = (0.75,0.5) for 
j = 1, . . . ,c. Thus, only two coefficients are active for each class, while four 
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predictors are relevant for exactly two classes, two predictors are relevant 
for a single class only, and nine predictors are completely irrelevant. In par- 
ticular, predictors in the second block are completely irrelevant while being 
highly correlated with relevant predictors in the first block. Note that these 
data are generated under a scheme which matches CSPS, that is, the subset 
of predictors relevant for the j'th class is small, and varies with j. 

In the second instance again n = 250 observations are generated, but now 
the true values of (3 are such that 



with different patterns of signs distinguishing the rows of (3 from one another. 
Again the intercept terms are set equal to (4). Also, predictors in the second 
block are again irrelevant but highly correlated with those in the first block. 
This artificial situation, whereby coefficient magnitudes are constant within 
columns, is construed as a situation which is favorable to OPS rather than 



For both datasets posterior sampling is implemented (i) when all predic- 
tors are retained in the multinomial regression model, (ii) when CSPS is 
implemented with p = 0, (iii) when CSPS is implemented with p = 0.7, and 
(iv) with OPS (i.e., the p = 1 limit). Throughout we set r 2 = 4 as a weakly 
informative prior specification for probit model coefficients acting upon stan- 
dardized predictors (recall that a given strength of association translates to 
a smaller coefficient value on the probit scale than on the more familiar 
and interpretable logit scale). We also set (71,72) = (5, 15), to reflect a weak 
prior belief that relatively few predictors are actually active. 

The ability of MCMC algorithms to mix well across the model space 
is invariably a concern in Bayesian model selection problems. As a simple 
diagnostic, two independent MCMC runs of 10 5 iterations after 10 4 burn- 
in iterations, starting from different initial models, are implemented. More 
precisely, the two chains are started at the empty and full models, that is, 
where the class-predictor pairs are either all excluded or all included. To 
conserve storage space, only parameter values at every tenth iteration are 
retained. The posterior probability of each coefficient being active, Mjk = 
E{Mjk\D}, is estimated separately from the two runs, with the two sets of 
estimates showing good agreement (Figure 1). 

To examine the MCMC mixing for M in more detail, we consider switching 
rates as follows. For the thinned posterior sample of Mjk values, the switch 
rate Sjk is simply the proportion of times that the value of Mjk differs 
from its predecessor. Note that the switch rate differs from the Metropolis- 
Hastings acceptance rate because of the thinning, and because a change to 
Mjk is proposed only once per p iterations on average. The switch rate Sjk 
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is plotted against 2Mjk(l — Mjk) m Figure 2, on the grounds that i.i.d. 
sampling of (M\D) would correspond to the identity line on this plot [since 
the probability that two independent Bernoulli(p) draws differ is 2p(l —p)]. 
Hence, the extent to which the plotted points fall below the identity line 
reflects the extent to which the thinned MCMC output mixes less well than 
i.i.d. sampling. Particularly then, the plots reveal that the MCMC mixing 
is worse when p = 1 (OPS) than when p < 1 (CSPS), in line with the earlier 
discussion in Section 1. 

For the first dataset, Figure 3 gives a graphical representation of true 
and estimated (3 values. Generally the results from any form of predictor 
selection (choice of p) and either choice of estimator (posterior mean, or 
posterior mean conditioned on median probability model) seem preferable 
to the results without any predictor selection. This is particularly the case 
under the p = prior specification, where the estimated and true forms of (5 



,p=0 



= 0.7 



p=1 




-i 1 r 

at ai o* s» o.a ia 






Fig. 1. Estimates M based on two independent MCMC runs plotted against each other. 
The two rows of plots correspond to the two scenarios. The three columns of plots corre- 
spond to p — Q, p = 0.7, p = 1. 
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Fig. 2. Empirical switch rates S for thinned MCMC output versus estimated switch 
rates under i.i.d. sampling, 2Mjk(l — Mjk)- The two rows of plots correspond to the two 
scenarios. The three columns of plots correspond to p = 0, p — 0.7, p = l. 



match very well. Note also that while the p = 0.7 and p = 1 fits appear sim- 
ilar, in the latter case the OPS scheme misses the relevance of one predictor 
(Xq) entirely. 

Figure 4 illustrates estimates of (5 from the second dataset. Again the 
estimates arrived at via model selection seem preferable to those based on 
retaining all predictors. Even though the true (3 values are compatible with 
OPS in this setting, the comparison between CSPS and OPS estimates seems 
mixed. In particular, we observe that OPS does not detect the relevance of 

x 5 . 

For the CSPS cases, we remark that while some relevant predictors are 
missed for some classes, sometimes surrogates for these predictors are de- 
tected. Although this is not desirable for interpretation purposes, it may 
help increase classification ability. 
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Fig. 3. Estimation of j3 for the first synthetic dataset. Each panel plots \/3jk\ via a 
greyscale with white-black corresponding to (0, 1.05). The top two panels depict true values 
and values estimated via including all predictors in the model. The middle row of panels 
corresponds to posterior mean estimation of f3 under CSPS (p~0 and p = 0.7) and OPS 
(p—1). The bottom row replaces the posterior mean with the median probability model 
estimates. 



As a sensitivity analysis, we also implement the Bayesian analysis with 
a more diffuse prior on the coefficients, via t 2 = 25. Figure 5 indicates that 
both CSPS and OPS results are somewhat poorer in this case, in keeping 
with the notion that Bayesian model selection procedures can be sensitive to 
the choice of "within-model" prior distributions (and in the present situation 
the true values of coefficients are quite compatible with the narrower choice 
of r 2 = 4). In practice, we suggest performing sensitivity analyses to assess 
the degree of robustness of the results. 

Finally, to examine the generality of the foregoing results (when r 2 = 4) 
across repeated sampling, 40 datasets are simulated under each of the two 
scenarios. For each dataset, the average squared error (ASE) in estimating el- 
ements of (3 by their posterior means is determined, under the four modeling 
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Fig. 4. Estimation of /3 for the second synthetic dataset. Each panel plots \Pjk\ via a 
greyscale, with white-black corresponding to (0,1.90). The layout is as per Figure 3. 



strategies (no predictor selection, CSPS with p = and with p = 0.7, OPS 
with p= 1). Here the averaging of squared-error is across the c x (p+ 1) = 80 
elements of (5. The distribution of ASE across repeated sampling is depicted 
in the boxplots of Figure 6. In Scenario 1, the advantage of some form of 
model selection over no selection is clear, as is the advantage of CSPS over 
OPS. Recall that Scenario 2 is designed as a context where OPS would likely 
do well, since the magnitude of elements of (3 is constant within columns. 
Consequently, CSPS with p = does not perform well relative to the other 
methods. However, the ASE incurred by CSPS with p = 0.7 is only slightly 
worse than that incurred by OPS, perhaps because MCMC sampling for 
CSPS tends to mix better than for OPS. 

4. Example: forensic glass classification. We apply our method to the 
forensic glass dataset which has been used as an example in the classification 
literature [see, e.g., Ripley (1996)]. Each of n = 214 glass fragments belongs 
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Fig. 5. Estimation of (3 under CSPS (p = and p = 0.7) and OPS (p=l) for the first 
and second synthetic datasets with r 2 = 25. Each panel plots \$jk\ via a greyscale, with 
white-black corresponding to (0,1.05) (Scenario 1); (0,1.90) (Scenario 2). The first row 
corresponds to estimation under the posterior mean for Scenario 1 and the second row for 
Scenario 2. 



to one of c + 1 = 6 classes. The zero (reference) class is taken to be window 
float glass, while classes one through five are window nonfloat glass, vehicle 
window glass, containers, tableware and vehicle headlights respectively. The 
frequency distribution of fragments across classes is (70, 76, 17, 13, 9, 29). The 
nine covariates are refractive index and percent by weight of various oxides 
(Na, Mg, Al, Si, K, CA, Ba, Fe). Standardized versions of these are denoted 
byV = (V 1 ,...,V 9 ). 

To obtain a flexible model relating the covariates to the categorical re- 
sponse variable, we form predictors X from covariates V by taking radial 
basis functions of the form 



(5) 



Xk = a k + b k exp 



w 



,* 1 1 2 



2h? 
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CSPS 



Fig. 6. Average squared error in estimating components of (5 (t 2 — A). Each boxplot de- 
picts the ASE for 40 simulated datasets, where ASE for a single dataset involves averaging 
the squared estimation error for all components of (3. From left to right the methods are as 
follows: CSPS (with p — 0), CSPS (with p — 0.7), OPS (p=l) and no predictor selection 
(NOPS), that is, all predictors included. 



for k = 1, . . . ,p. Radial basis functions are commonly used to obtain a flexible 
nonlinear model in terms of the original covariates, with the simple structure 
of a linear predictor for the basis functions [see, e.g., Hastie, Tibshirani and 
Friedman (2001), Section 6.7]. Often the knots v* ) ...,Vp are taken to be 
the observed V values, so that p = n. In the first instance, though, we set 
p = 54, and take the t>f , . . . ,v* to be p randomly selected covariate values. 
Following Figuerido (2003), the bandwidth parameter in (5) is set at h = 
4. The constants and are chosen to standardize Xk, which seems 
reasonable given the use of an exchangeable prior distribution across (5. 

In this example the hyperparameter values r 2 = 25 and 7 = (5, 15) are 
used, with the former specification intended to represent diffuse prior infor- 
mation. (Even though the empirical results in Section 3 were better when 
r 2 = 4, the use of a wider prior seems appropriate without specific subject- 
area knowledge to suggest otherwise.) For each of p = 0, p = 0.7, p = 1, 
two MCMC runs of 50 x 10 4 post burn-in iterations are implemented, with 
every 50th value retained. Comparison of the M estimates from the two 
independent MCMC runs suggests that the sampler mixes adequately over 
the model space (results not shown) , and the remaining results are based on 
pooling the two runs (yielding a thinned posterior sample of size 2 x 10 ). 
The relationship between empirical switch rates Sjk and Mjk also suggests 
reasonable sampler performance. 

For all three analyses (CSPS p = 0, 0.7 and OPS) we observe that the 
posterior distribution over M is not favoring sparse classifiers, in that none 
of the class-predictor pairs have negligible posterior probability of inclusion. 
There is sparsity, however, in the sense that a relatively small number of 
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components have large posterior probability. For instance, for CSPS with 
p = 0, the smallest Mjk is 0.14, yet Mjk exceeds 0.5 for only 9 of 270 class- 
predictor pairs. Larger proportions exceed this threshold for CSPS with 
p = 0.7 (28 out of 270 class-predictor pairs) and OPS (8 out of 54 predictors). 
Coefficient estimates from the three analyses appear in Figure 7. 

To get some idea of the classifier performances, 10-fold cross-validation 
is undertaken. That is, the data are randomly split into ten equal-sized 
blocks, then repeatedly the CSPS and OPS models are fit to the training 
data (all but one block) to predict Y from X for the validation data (the 
held-out block). Note that for each split, the p = 54 radial basis functions 
are formed using covariate values randomly selected from the training data 
only. Point predictions are based on the mode of the predictive distribution 
for (5^|X) (i.e., model averaging rather than selection is used). We obtain 
an overall misclassification rate of 29%, 29% and 30% for CSPS p = 0, CSPS 
p = 0.7 and OPS p = 1, respectively. These misclassification rates for CSPS 




Fig. 7. Estimation of [3 for the forensic glass example. Each panel plots \j3jk\ ma a 
greyscale, with white-black corresponding to (0,2.40). 



MULTINOMIAL CLASS-SPECIFIC PREDICTOR SELECTION 



17 



are competitive with some of the results cited in the literature for these data 
[see, e.g., Figuredo (2003)]. 

We also investigate whether using a larger number of predictors affects the 
performances of the CSPS and OPS algorithms differentially. To do so, we 
take p = 192, so that when performing 10-fold cross-validation, essentially 
all available training set covariate values are used to form radial basis pre- 
dictors. Interestingly, we now obtain an overall misclassification rate of 28%, 
29%, and 42% for CSPS p = 0, CSPS p = 0.7 and OPS p = l, respectively. 
While the predictive performance of CSPS is about the same as when p = 54, 
the performance of OPS clearly deteriorates. Note that the mixing of the 
MCMC chains for OPS is reasonable, so that OPS's decrease in performance 
is directly related to the model selection strategy adopted. 

5. Example: image classification. As a third example we consider a clas- 
sification problem for which data are available from the UCI Machine Learn- 
ing Repository [Asuncion and Newman (2007)]. The n = 210 units are 3x3 
pixel regions of images of outdoor scenes, each of which is manually identified 
as belonging to one of seven classes (brickface, sky, foliage, cement, window, 
path, grass). There are 30 regions per class. The classification task is to iden- 
tify the correct class using p = 18 continuous predictors giving attributes of 
the regions. These predictors involve the position of the region within the 
larger image, color attributes, and various other attributes (e.g., based on 
line detection, edge detection, etc.). A complete description of the predictors 
is available [Asuncion and Newman (2007)]. As in the previous examples, 
we standardize the predictors. Arguably image classification is a problem 
where it makes particular sense that the predictors useful for characteriz- 
ing a given class will vary with this class. This point is also emphasized by 
Kueck, Carbonetto and de Freitas (2004) and Gustafson, Thompson and de 
Freitas (2007). 

The CSPS scheme (with both p = and p = 0.5), the OPS scheme and in- 
ference without predictor selection (NOPS) are applied to these data. Again, 
t 2 = 25 and 7 = (5, 15) are used. Visual inspection of (3 estimates (plots not 
shown) reveals a few class-predictor pairs standing out more strongly under 
CSPS than under OPS or NOPS, particularly when median-posterior-model 
estimates are used. 

To assess the impact of predictor selection on predictive performance, we 
generate 25 random splits of the units into training and validation samples 
of size 105 units each. For each training sample we fit CSPS (p = 0,0.5), 
OPS and NOPS models, and generate predictive distributions for the vali- 
dation sample units. Using point classifications based on the modes of the 
predictive distributions, Figure 8 displays the number of correct validation- 
set predictions for each data split and method. The results of this example 
are particularly interesting in that the predictive ability of the algorithm 
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CSPS. rtld=Q CSPS, rt»=g.5 OPS NOPS 

Fig. 8. Results from 25 random splits of the image classification data. The across-s- 
plit distributions of the number of correct validation set classifications ( out of 105 ) are 
portrayed for CSPS (p = 0,0.5), OPS and NOPS analyses. 

appears to be superior between the extreme cases of p = and p = l. When 
comparing the algorithm for these two values of p, we observe that the dif- 
ferent data splits are in turn favoring p = or p=l- When the algorithm is 
run with p = 0.5, the number of correct classifications often approaches the 
better of the two extreme cases. Per validation sample, CSPS p = 0.5 has 
on average only about 0.6 more correct predictions than CSPS p = and 
OPS (CSPS p = 0: 95.0; CSPS p = 0.5: 95.6; OPS: 94.96; NOPS: 94.64). We 
believe it worthy though to note that CSPS with p = 0.5 has the most stable 
performance across different training- validation splits. 

6. Example: microarray cancer data classification. The problem of de- 
tecting relevant genes for classification purposes from microarray data has re- 
ceived a lot of recent attention. The typical set-up is that among thousands of 
genes one tries to identify those which are relevant for predicting diagnostic 
categories. The fact that the number of observations is typically very small in 
comparison to the number of genes available often makes this task difficult. 
In this final example we apply our algorithm to the well-known Hereditary 
Breast Cancer data [Hedenfalk et al. (2001)], which can be downloaded at 
http : //research . nhgri . nih . gov/microarray/NE JM_Supplement/. This 
dataset consists of 22 observations on 3226 gene expression levels. Each 
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observation (sample) is classified into one of the following three categories: 
BRCA1, BRCA2 and sporadic, for a total of 7, 8 and 7 samples in each 
class, respectively. 

As is often done with such high-dimensional problems, we initially per- 
form a univariate screening to identify a subset of genes that are most likely 
to be related to the three possible outcomes. We first apply a log base-2 
transformation followed by a standardization for each of the 3226 gene in- 
tensity ratios measured. We then implement a univariate CSPS analysis with 
p = and t 2 = 25 for each of the 3226 genes, using the sporadic cancer class 
as reference. For each analysis, we obtain a post burn-in thinned sample for 
M of size 2 x 10 3 (burn-in: 2000, thin: 10 iterations) and calculate the pro- 
portion of times the gene was selected by the algorithm for the nonreference 
classes BRCA1 and BRCA2. A subset of genes is selected for further analy- 
ses in the following way: we retain all genes having posterior probability of 
inclusion greater than 0.5 for either or both classes, for a total of 621 genes. 
We present in Figure 9 a histogram for the difference in gene probability of 
inclusion between the two nonreference classes. This histogram shows that 
most preselected genes are highly relevant for only one of the two classes, 
and therefore hints at the appropriateness of CSPS analysis over OPS. 



s 1 




I 
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o _ 

tM 



r 1 1 r 1 

-1.0 -0.5 a.o Q.5 1.0 

Different* in probability 01 inclusion 

Fig. 9. Histogram for the difference in probability of inclusion for the preselected genes 
(BRCA1 vs BRCA2). Each preselected gene is such that the probability of inclusion is 
greater than 0. 5 for at least one of the two nonreference classes. 
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Many authors have analyzed this dataset. In the Bayesian literature, Ye- 
ung, Bumgarner and Raftery (2005) introduce a Bayesian model averaging 
algorithm where the combination of two (or more) separate logistic regres- 
sions is used as an attempt to approximate a multinomial regression model. 
Their algorithm finds between 13-18 relevant genes and results in 6 misclas- 
sification errors using leave-one-out cross-validation (LOOCV). Using their 
formulation of CSPS and OPS, Zhou, Wang and Dougherty (2006) report 
zero misclassification errors using a truncated version of the data and a 
subset of pre-screened genes. 

We implement LOOCV for CSPS p = 0, 0.5 and OPS with r 2 = 25 on the 
subset of p = 621 genes. We set the prior distribution to q ~ Beta(5,200), 
which represents a prior belief that outcomes can be distinguished using only 
a very small number of genes (on average ~ 15 genes per class). Identifying 
small sets of relevant genes is generally desirable, as this is often a prerequi- 
site for the development of inexpensive diagnostic tests [Yeung, Bumgarner 
and Raftery (2005)]. 

The fact that the algorithm allows only for one independent modification 
to the state of M per iteration (per class) may hinder proper model explo- 
ration when the number of possible predictors is very large. To account for 
this feature of the algorithm, we let the MCMC run for a considerably large 
number of iterations. We consider 2,000,000 iterations (with a burn-in of 
200,000 iterations) and retain every 200th iteration. Based on the results of 
the LOOCV, we obtained the posterior predictive distribution for each of 
the 22 samples under the model averaged estimator. 

All three analyses classify correctly the same 19 of the 22 observations. 
In Table 1 we present the predictive probability on the true class for each 
observation and analysis. Note that CSPS tends to put more predictive 
weight on the correct class than OPS, which is desirable in the present 
scientific context. Moreover, there is considerable uncertainty in classifying 
the 5th sample under OPS, as less than 2% separates the two most likely 
classes. We obtain an average number of active predictors of 16 and 17 
per class for CSPS p = and p = 0.5, respectively, and 17 for OPS. These 
values are very similar to the prior expected values. In order to examine the 
sensitivity of the results to the choice of prior on q, we redo the analyses with 
q ~ Beta(5, 120). In this case we obtain two misclassification errors for CSPS 
(p = and 0.5), four errors for OPS (Table 2) and an average number of 
active genes that similarly reflects the prior specification (25 and 26 per class 
for CSPS p = and p = 0.5, resp., and 26 for OPS). It is not surprising that 
the results are highly influenced by the choice of prior on q, considering the 
very small sample size (n = 22) available. More experimentation with priors 
allowing for a larger proportion of active predictors suggests that when p 
is large the algorithms can become impractically slow unless the prior on q 
encourages sufficient sparsity. 
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Table 1 

Posterior predictive probability for the correct class under LOOGV, for 
CSPS (p = 0,0.5,) and OPS using the model averaging estimator with 
q ~ Beta(5, 200) 



Sample 


CSPS 


CSPS 


OPS 




o = 


p = 0.5 




i 


0.58 


0.56 


0.51 


2 


0.74 


0.73 


0.68 


3 


0.63 


0.58 


0.54 


4 


0.75 


0.75 


0.66 


5 


0.45 


0.42 


0.37 


6 


0.51 


0.50 


0.46 


7 


0.50 


0.52 


0.51 


8 


0.63 


0.60 


0.53 


9 


0.69 


0.82 


0.63 


10 


0.53 


0.52 


0.53 


11 


0.42* 


0.42* 


0.39* 


12 


0.42* 


0.39* 


0.41* 


13 


0.66 


0.65 


0.55 


14 


0.59 


0.58 


0.51 


15 


0.70 


0.69 


0.57 


16 


0.47 


0.46 


0.40 


17 


0.23* 


0.23* 


0.21* 


18 


0.59 


0.56 


0.51 


19 


0.92 


0.92 


0.84 


20 


0.86 


0.86 


0.77 


21 


0.78 


0.73 


0.69 


22 


0.83 


0.86 


0.74 



*Misclassified observation. 



7. Discussion. In general, with regression modeling, particularly when 
the number of predictors is substantial, schemes to encourage sparse solu- 
tions are called for on grounds of interpretability of the solutions, predictive 
generalization and computational ease. Here sparsity might refer to shrink- 
ing regression coefficients toward zero, or forcing some coefficients to be 
exactly zero. In the latter case, one tries to identify a subset of predictors 
that are judged to be irrelevant to the outcome. 

In the case of multinomial response models on (c+ 1) categories, removing 
a predictor from the model corresponds to setting c coefficients to zero. The 
more nuanced strategy of CSPS that we have investigated is to set individual 
coefficients to zero, either without regard for the other coefficients in the 
same column (the p = implementation of CSPS), or with some partial 
regard to other elements in the column (0 < p < 1). This makes conceptual 
sense in many applications where sparse multinomial regression solutions 
are sought. Moreover, at least in the examples we have considered, this 
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Table 2 

Posterior predictive probability for the correct class for CSPS 
(p = 0,0.5) and OPS using the model averaging estimator with 
q ~ Beta(5, 120) 



Sample 


CSPS 


CSPS 


OPS 






p = 0.5 




i 


0.57 


0.56 


0.47 


2 


0.76 


0.74 


0.65 


3 


0.59 


0.63 


0.54 


4 


0.76 


0.75 


0.64 


5 


0.46 


0.44 


0.31* 


6 


0.51 


0.51 


0.43 


7 


0.56 


0.50 


0.51 


8 


0.61 


0.66 


0.53 


9 


0.68 


0.76 


0.62 


10 


0.54 


0.54 


0.51 


11 


0.44 


0.44 


0.41* 


12 


0.42* 


0.43* 


0.40* 


13 


0.64 


0.68 


0.52 


14 


0.58 


0.59 


0.48 


15 


0.72 


0.71 


0.55 


16 


0.48 


0.48 


0.41 


17 


0.22* 


0.23* 


0.19* 


18 


0.59 


0.58 


0.49 


19 


0.92 


0.92 


0.80 


20 


0.87 


0.84 


0.74 


21 


0.77 


0.76 


0.65 


22 


0.82 


0.83 


0.71 



*Misclassified observation. 



conceptual sense tends to translate to estimates and predictions which are 
as good or better than those obtained via the standard approach of OPS. 

We have also seen that to some degree computational ease is a fortunate 
by-product of CSPS over OPS, since it is easier to traverse a more dense 
space in which the constituent models are closer together. Notwithstanding 
this relative computational ease, however, in general, all MCMC schemes 
for sampling models with respect to extremely large model spaces involve 
a high computational burden. Indeed, one school of thought is that scaling 
up "true" posterior sampling over extremely large model spaces is not a re- 
alistic goal, and that MCMC algorithms are better thought of as devices to 
find good models [Hans, Dobra and West (2007)]. In our experience, using 
a C implementation [Lefebvre and Gustafson (2008)] of CSPS, computation 
time and mixing over M is practical (from the perspective of actual poste- 
rior sampling) in problems with hundreds of predictors, but not thousands. 
Some modest further scale-up might be achievable through optimization of 
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the code, but substantial further scale-up would seem to require a fresh 
computational approach. 

APPENDIX 

In what follows we use M[j«] rather than Mj m to denote the jth row of 
M, in order to avoid nested subscripting. Also note that when a double- 
subscript is applied to Z, Z%j denotes the value of Zj for the ith unit, with 
Z 9 j denoting the values of Zj for all units. The posterior distribution over 
all unknown quantities can be expressed as 

ir(q, M, Z, p\D) = ir(q, M, Z\D)tt((3\Z, M). 

The second term n(/3\Z,M) involves independent multivariate normal dis- 
tributions of varying dimension for the rows of (5, as arises from standard 
Bayesian updating in normal linear models with normal priors on regression 
coefficients. Thus, it suffices to implement MCMC with n(q,M,Z\D) as the 
target distribution. The attractive feature of this scheme is that (q,M,Z) is 
of fixed dimension, so that straightforward MCMC updates can be applied. 
Specifically, we can write 

n 

n(q, M, Z\D) oc 7r(g, M, Z) J] I{Z it G %;]} 

i=l 

n 

cx 7r(q)ir(M\q)ir(Z\M) J] I{Zi. G S[ yi }}, 
i=l 

where S[y] comprises the values of Z = (Z±, . . . , Z c ) consistent with Y = y. 
Note that ir(Z\M) takes the form 

c 

(6) 7r(Z\M) = Y[ ^n{Zuj]X M y^ M y u ],I n +X M y m ]V M \j»]X' M y a ]}, 

J'=l 

where (J>mu»] an d ^M[>] denote the prior mean and variance for the active 
regression coefficients pertaining to the jth class. In the case that c = 1, 
Holmes and Held (2006) give an efficient algorithm to sweep through the data 
index and compute the mean and variance (both scalars) of Zi given Z_j 
arising from (6). This trivially extends to c > 1 by applying the algorithm 
directly (i.e., in parallel) to give the mean vector and diagonal variance 
matrix for Zi, given Zi_j\ % . Then for given i it is a simple matter to sweep 
through j and update according to the appropriate truncated normal 
distribution which ensures that Z^ G S[yi). 

To update M, for each j, a new value of M[j»] is proposed by toggling one 
component of the current value. Computation of the requisite Metropolis- 
Hastings acceptance probability is straightforward, again following Holmes 
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and Held (2006). In particular, the relevant full conditional density takes 
the form 



7r(M[j.]\q,M[(-j).],Z,D) 

(7) 



IT/ 1 1/2 1 f> 1-1/2 
« \V M [j.]\ \V M \j.]\ 



x exp{±0 A/ y.j -A'M[j«]) / ^M[i.](^M[i.] - Afcf|>])} 
xvr(M| (? ), 

where p>MUu] an d are the posterior mean and variance of the active 

jth class regression coefficients. That is, V^h-, = ~^ ■ X M\j»] X M\j»] anc ^ 

PM[j.} = ^M[j.}( X ' M [j.] Z 'j + V M\j,]^M[j.\)- 

Note that the prior density for (M\q) is of complicated form, namely, 

<M\q) = f[[(l-q)p (p,q) M+k n-Po(p,q)} C - M+k 
k=l 

+ qpi(p,q) M+k {l-pi(p,q)} c - M+k ], 



where Po{p,q) = (1 — P 1 ^ 2 )q and pi(p,q) = (1 — p 1 ^ 2 )q + /3 1 / 2 . Thus, we sim- 
ply use a random-walk Metropolis-Hastings algorithm to update q given 
(M,Z,D). 

The above algorithm adapts quite readily to the p = 1 case of OPS. The 
update to Z is unchanged. The update to M, which is now a vector, involves 
a product of c terms of the form (7) in the acceptance probability. The 
update to q becomes simpler, since the full conditional distribution of q 
becomes a beta distribution. 

Acknowledgment. We thank the editorial team for many useful sugges- 
tions. 

SUPPLEMENTARY MATERIAL 

C implementation (DOI: 10.1214/08-AOAS188SUPP; .zip). Code to im- 
plement CSPS and OPS appears in a supplementary material file posted at 
the journal website. 
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